In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

# ## Plotly plotting support
# import plotly.plotly as py

import plotly.offline as py
py.init_notebook_mode(connected=False)

import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf

cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

Introduction

In this lecture we examine the process of data cleaning and Exploratory Data Analysis (EDA). Often you will acquire or even be given a collection of data in order to conduct some analysis or answer some questions. The first step in using that data is to ensure that it is in the correct form (cleaned) and that you understand its properties and limitations (EDA). Often as you explore data through EDA you will identify additional transformations that may be required before the data is ready for analysis.

In this notebook we obtain crime data from the city of Berkeley's public records. Ultimately, our goal might be to understand policing patterns but before we get there we must first clean and understand the data.

Getting the Data

To begin this analysis we want to get data about crimes in Berkeley. Remarkably, the city of Berkeley maintains an Open Data Portal for citizens to access data about the city. We will be examining the:

  1. Call Data
  2. Stop Data

Fortunately, this data is also relatively well document with detailed descriptions of what it contains. Here are summaries of the fields in the data:

Calls Data

Stop Data

Most data has bad documentation:

Unfortunately, data is seldom well documented and when it is you may not be able to trust the documentation. It is therefore critical that when we download the data we investigate the fields and verify that it reflects the assumptions made in the documentation.

Reproducible Data Science

In the interest of reproducible data science we will download the data programatically. We have defined some helper functions in the utils.py file. I can then reuse these helper functions in many different notebooks.

In [2]:
from utils import fetch_and_cache
In [3]:
help(fetch_and_cache)
Help on function fetch_and_cache in module utils:

fetch_and_cache(data_url, file, data_dir='data', force=False)
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.

I can actually look at the source code in the notebook.

In [4]:
fetch_and_cache??

Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use the python importlib library:

from importlib import reload
reload(utils)

or use iPython magic which will intelligently import code when files change:

%load_ext autoreload
%autoreload 2

Downloading the Data

Notice that because I record how I got the data in the notebook, others can reproduce this experiment. However, it is worth noting that the data can change. We will want to pay attention to file timestamps.

In [5]:
calls_file = fetch_and_cache("https://data.cityofberkeley.info/api/views/k2nh-s5h5/rows.csv?accessType=DOWNLOAD",
                "calls_for_service.csv")
Using cached version downloaded: Wed Jan 24 23:19:50 2018
In [6]:
stops_file = fetch_and_cache("https://data.cityofberkeley.info/resource/4p7k-drdw.json?$limit=100000",
                "stops.json")
Using cached version downloaded: Wed Jan 24 23:49:01 2018

To demonstrated live downloading (of something we won't need in lecture) we can download professor Gonzalez's website.

In [7]:
gonzalez_file = fetch_and_cache("https://people.eecs.berkeley.edu/~jegonzal/assets/jegonzal.jpg", 
                                "gonzalez.jpg", 
                                force=True)
Downloading... Done!

Exploring the data

Now that we have obtained the data we want to understand its:

  • Structure -- the "shape" of a data file
  • Granularity -- how fine/coarse is each datum
  • Scope -- how (in)complete is the data
  • Temporality -- how is the data situated in time
  • Faithfulness -- how well does the data capture "reality"

Structure

Before we even begin to load the data it often helps to understand a little about the high-level structure:

  1. How much data do I have?
  2. How is it formatted?

How big is the data?

I often like to start my analysis by getting a rough estimate of the size of the data. This will help inform the tools I use and how I view the data. If it is relatively small I might use a text editor or a spreadsheet to look at the data. If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.

In [8]:
type(stops_file)
Out[8]:
pathlib.PosixPath
In [9]:
help(stops_file)
Help on PosixPath in module pathlib object:

class PosixPath(Path, PurePosixPath)
 |  PurePath represents a filesystem path and offers operations which
 |  don't imply any actual filesystem I/O.  Depending on your system,
 |  instantiating a PurePath will return either a PurePosixPath or a
 |  PureWindowsPath object.  You can also instantiate either of these classes
 |  directly, regardless of your system.
 |  
 |  Method resolution order:
 |      PosixPath
 |      Path
 |      PurePosixPath
 |      PurePath
 |      builtins.object
 |  
 |  Methods inherited from Path:
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, t, v, tb)
 |  
 |  absolute(self)
 |      Return an absolute version of this path.  This function works
 |      even if the path doesn't point to anything.
 |      
 |      No normalization is done, i.e. all '.' and '..' will be kept along.
 |      Use resolve() to get the canonical path to a file.
 |  
 |  chmod(self, mode)
 |      Change the permissions of the path, like os.chmod().
 |  
 |  exists(self)
 |      Whether this path exists.
 |  
 |  expanduser(self)
 |      Return a new path with expanded ~ and ~user constructs
 |      (as returned by os.path.expanduser)
 |  
 |  glob(self, pattern)
 |      Iterate over this subtree and yield all existing files (of any
 |      kind, including directories) matching the given pattern.
 |  
 |  group(self)
 |      Return the group name of the file gid.
 |  
 |  is_block_device(self)
 |      Whether this path is a block device.
 |  
 |  is_char_device(self)
 |      Whether this path is a character device.
 |  
 |  is_dir(self)
 |      Whether this path is a directory.
 |  
 |  is_fifo(self)
 |      Whether this path is a FIFO.
 |  
 |  is_file(self)
 |      Whether this path is a regular file (also True for symlinks pointing
 |      to regular files).
 |  
 |  is_socket(self)
 |      Whether this path is a socket.
 |  
 |  is_symlink(self)
 |      Whether this path is a symbolic link.
 |  
 |  iterdir(self)
 |      Iterate over the files in this directory.  Does not yield any
 |      result for the special paths '.' and '..'.
 |  
 |  lchmod(self, mode)
 |      Like chmod(), except if the path points to a symlink, the symlink's
 |      permissions are changed, rather than its target's.
 |  
 |  lstat(self)
 |      Like stat(), except if the path points to a symlink, the symlink's
 |      status information is returned, rather than its target's.
 |  
 |  mkdir(self, mode=511, parents=False, exist_ok=False)
 |      Create a new directory at this given path.
 |  
 |  open(self, mode='r', buffering=-1, encoding=None, errors=None, newline=None)
 |      Open the file pointed by this path and return a file object, as
 |      the built-in open() function does.
 |  
 |  owner(self)
 |      Return the login name of the file owner.
 |  
 |  read_bytes(self)
 |      Open the file in bytes mode, read it, and close the file.
 |  
 |  read_text(self, encoding=None, errors=None)
 |      Open the file in text mode, read it, and close the file.
 |  
 |  rename(self, target)
 |      Rename this path to the given path.
 |  
 |  replace(self, target)
 |      Rename this path to the given path, clobbering the existing
 |      destination if it exists.
 |  
 |  resolve(self, strict=False)
 |      Make the path absolute, resolving all symlinks on the way and also
 |      normalizing it (for example turning slashes into backslashes under
 |      Windows).
 |  
 |  rglob(self, pattern)
 |      Recursively yield all existing files (of any kind, including
 |      directories) matching the given pattern, anywhere in this subtree.
 |  
 |  rmdir(self)
 |      Remove this directory.  The directory must be empty.
 |  
 |  samefile(self, other_path)
 |      Return whether other_path is the same or not as this file
 |      (as returned by os.path.samefile()).
 |  
 |  stat(self)
 |      Return the result of the stat() system call on this path, like
 |      os.stat() does.
 |  
 |  symlink_to(self, target, target_is_directory=False)
 |      Make this path a symlink pointing to the given path.
 |      Note the order of arguments (self, target) is the reverse of os.symlink's.
 |  
 |  touch(self, mode=438, exist_ok=True)
 |      Create this file with the given access mode, if it doesn't exist.
 |  
 |  unlink(self)
 |      Remove this file or link.
 |      If the path is a directory, use rmdir() instead.
 |  
 |  write_bytes(self, data)
 |      Open the file in bytes mode, write to it, and close the file.
 |  
 |  write_text(self, data, encoding=None, errors=None)
 |      Open the file in text mode, write to it, and close the file.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from Path:
 |  
 |  cwd() from builtins.type
 |      Return a new path pointing to the current working directory
 |      (as returned by os.getcwd()).
 |  
 |  home() from builtins.type
 |      Return a new path pointing to the user's home directory (as
 |      returned by os.path.expanduser('~')).
 |  
 |  ----------------------------------------------------------------------
 |  Static methods inherited from Path:
 |  
 |  __new__(cls, *args, **kwargs)
 |      Construct a PurePath from one or several strings and or existing
 |      PurePath objects.  The strings and path objects are combined so as
 |      to yield a canonicalized path, which is incorporated into the
 |      new PurePath object.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from PurePath:
 |  
 |  __bytes__(self)
 |      Return the bytes representation of the path.  This is only
 |      recommended to use under Unix.
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __fspath__(self)
 |  
 |  __ge__(self, other)
 |      Return self>=value.
 |  
 |  __gt__(self, other)
 |      Return self>value.
 |  
 |  __hash__(self)
 |      Return hash(self).
 |  
 |  __le__(self, other)
 |      Return self<=value.
 |  
 |  __lt__(self, other)
 |      Return self<value.
 |  
 |  __reduce__(self)
 |      helper for pickle
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __rtruediv__(self, key)
 |  
 |  __str__(self)
 |      Return the string representation of the path, suitable for
 |      passing to system calls.
 |  
 |  __truediv__(self, key)
 |  
 |  as_posix(self)
 |      Return the string representation of the path with forward (/)
 |      slashes.
 |  
 |  as_uri(self)
 |      Return the path as a 'file' URI.
 |  
 |  is_absolute(self)
 |      True if the path is absolute (has both a root and, if applicable,
 |      a drive).
 |  
 |  is_reserved(self)
 |      Return True if the path contains one of the special names reserved
 |      by the system, if any.
 |  
 |  joinpath(self, *args)
 |      Combine this path with one or several arguments, and return a
 |      new path representing either a subpath (if all arguments are relative
 |      paths) or a totally different path (if one of the arguments is
 |      anchored).
 |  
 |  match(self, path_pattern)
 |      Return True if this path matches the given pattern.
 |  
 |  relative_to(self, *other)
 |      Return the relative path to another path identified by the passed
 |      arguments.  If the operation is not possible (because this is not
 |      a subpath of the other path), raise ValueError.
 |  
 |  with_name(self, name)
 |      Return a new path with the file name changed.
 |  
 |  with_suffix(self, suffix)
 |      Return a new path with the file suffix changed (or added, if none).
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from PurePath:
 |  
 |  anchor
 |      The concatenation of the drive and root, or ''.
 |  
 |  drive
 |      The drive prefix (letter or UNC path), if any.
 |  
 |  name
 |      The final path component, if any.
 |  
 |  parent
 |      The logical parent of the path.
 |  
 |  parents
 |      A sequence of this path's logical parents.
 |  
 |  parts
 |      An object providing sequence-like access to the
 |      components in the filesystem path.
 |  
 |  root
 |      The root of the path, if any.
 |  
 |  stem
 |      The final path component, minus its last suffix.
 |  
 |  suffix
 |      The final component's last suffix, if any.
 |  
 |  suffixes
 |      A list of the final component's suffixes, if any.

In [10]:
stops_file.stat()
Out[10]:
os.stat_result(st_mode=33188, st_ino=8601000273, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=6123225, st_atime=1517462196, st_mtime=1516866541, st_ctime=1516866541)
In [11]:
calls_file.stat().st_size
Out[11]:
976004

All the files are relatively small and we could comfortable examine them in a text editors. (Personally, I like sublime or emacs but others may have a different view.).

In listing the files I noticed that the names suggest that they are all text file formats:

  • CSV: Comma separated values is a very standard table format.
  • JSON: JavaScript Object Notation is a very standard semi-structured file format used to store nested data.

We will dive into the formats in a moment. However because these are text data I might also want to investigate the number of lines which often correspond to records.

In [12]:
from utils import line_count
help(line_count)
Help on function line_count in module utils:

line_count(file)
    Computes the number of lines in a file.
    
    file: the file in which to count the lines.
    return: The number of lines in the file

In [13]:
print("Lines in the calls file", line_count(calls_file))
Lines in the calls file 16053
In [14]:
print("Lines in the stops file", line_count(stops_file))
Lines in the stops file 29851

If your iPython session is running on a POSIX compliant machine (e.g., mac/linux) you could also run:

!wc data/*

which would produce (lines, words, characters):

   16053   85512  976004 data/calls_for_service.csv
     148    1541   54988 data/gonzalez.jpg
   29850  658525 6123225 data/stops.json
   46051  745578 7154217 total
In [15]:
# !wc data/*

If we wanted to learn more about the command we could use the man (short for manual) command to read the manual page for wc (word count).

Additional File Meta-Data

Meta-data is data about the data. Most filesystems will maintain information about when a file was created, read, or modified. Below we use the Python time library to print the time in the local representation. Note, time is often measured in Unix Time (time in second since January 1st, 1970).

In [16]:
import time 
print("Time of creation (seconds since Epoch):", 
      stops_file.stat().st_ctime)
print("Local time of creation:", 
      time.ctime(stops_file.stat().st_ctime))
Time of creation (seconds since Epoch): 1516866541.449748
Local time of creation: Wed Jan 24 23:49:01 2018

What is the file format? (Can we trust extensions?)

We already noticed that the files end in csv and json which suggests that these are comma separated and javascript object files respectively. However, we can't always rely on the naming as this is only a convention. For example, here we picked the name of the file when downloading based on some hints in the URL.

Often files will have incorrect extensions or no extension at all.

Let's assume that these are text files (and do not contain binary encoded data) so we can print a "few lines" to get a better understanding of the file.

In [17]:
from utils import head
help(head)
Help on function head in module utils:

head(filename, lines=5)
    Returns the first few lines of a file.
    
    filename: the name of the file to open
    lines: the number of lines to include
    
    return: A list of the first few lines from the file.

In [18]:
head(calls_file)
Out[18]:
['CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State\n',
 '18000273,VEHICLE STOLEN,01/01/2018 12:00:00 AM,20:30,MOTOR VEHICLE THEFT,1,01/24/2018 03:30:18 AM,"1100 PARKER ST\n',
 'Berkeley, CA\n',
 '(37.859364, -122.288914)",1100 PARKER ST,Berkeley,CA\n',
 '17092476,BURGLARY AUTO,12/12/2017 12:00:00 AM,13:30,BURGLARY - VEHICLE,2,01/24/2018 03:30:17 AM,"2300 LE CONTE AVE\n']

Note: Because the gonzalez_file is a JPEG (image) it is not a character file format and looking at the head of the file isn't going to work:

In [19]:
head(stops_file)
Out[19]:
['{\n',
 '  "meta" : {\n',
 '    "view" : {\n',
 '      "id" : "6e9j-pj9p",\n',
 '      "name" : "Berkeley PD - Stop Data",\n']
In [20]:
# head(gonzalez_file)

What are some observations about Calls data?

  1. It appears to be in comma separated value (CSV) format.
  2. First line contains the column headings.
  3. There are lots of new-line \n characters:
    • at the ends of lines (delimiting records?)
    • within records as part of addresses.
  4. There are "quoted" strings in the Block_Location column:
    "2500 LE CONTE AVE
    Berkeley, CA
    (37.876965, -122.260544)"
    these are going to be difficult. What are the implications on our earlier line count calculations?
In [21]:
head(calls_file)
Out[21]:
['CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State\n',
 '18000273,VEHICLE STOLEN,01/01/2018 12:00:00 AM,20:30,MOTOR VEHICLE THEFT,1,01/24/2018 03:30:18 AM,"1100 PARKER ST\n',
 'Berkeley, CA\n',
 '(37.859364, -122.288914)",1100 PARKER ST,Berkeley,CA\n',
 '17092476,BURGLARY AUTO,12/12/2017 12:00:00 AM,13:30,BURGLARY - VEHICLE,2,01/24/2018 03:30:17 AM,"2300 LE CONTE AVE\n']

What are some observations about Stops data?

This appears to be a fairly standard JSON file. We notice that the file appears to contain a description of itself in a field called "meta" (which is presumably short for meta-data). We will come back to this meta data in a moment but first let's quickly discuss the JSON file format.

In [22]:
head(stops_file, lines=20)
Out[22]:
['{\n',
 '  "meta" : {\n',
 '    "view" : {\n',
 '      "id" : "6e9j-pj9p",\n',
 '      "name" : "Berkeley PD - Stop Data",\n',
 '      "attribution" : "Berkeley Police Department",\n',
 '      "averageRating" : 0,\n',
 '      "category" : "Public Safety",\n',
 '      "createdAt" : 1444171604,\n',
 '      "description" : "This data was extracted from the Department’s Public Safety Server and covers the data beginning January 26, 2015.  On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014).  Under that order, officers were required to provide certain data after making all vehicle detentions (including bicycles) and pedestrian detentions (up to five persons).  This data set lists stops by police in the categories of traffic, suspicious vehicle, pedestrian and bicycle stops.  Incident number, date and time, location and disposition codes are also listed in this data.\\r\\n\\r\\nAddress data has been changed from a specific address, where applicable, and listed as the block where the incident occurred.  Disposition codes were entered by officers who made the stop.  These codes included the person(s) race, gender, age (range), reason for the stop, enforcement action taken, and whether or not a search was conducted.\\r\\n\\r\\nThe officers of the Berkeley Police Department are prohibited from biased based policing, which is defined as any police-initiated action that relies on the race, ethnicity, or national origin rather than the behavior of an individual or information that leads the police to a particular individual who has been identified as being engaged in criminal activity.",\n',
 '      "displayType" : "table",\n',
 '      "downloadCount" : 417,\n',
 '      "hideFromCatalog" : false,\n',
 '      "hideFromDataJson" : false,\n',
 '      "indexUpdatedAt" : 1496270706,\n',
 '      "newBackend" : false,\n',
 '      "numberOfComments" : 0,\n',
 '      "oid" : 26097722,\n',
 '      "provenance" : "official",\n',
 '      "publicationAppendEnabled" : false,\n']

A quick note on JSON

JSON (JavaScript Object Notation) is a common format for exchanging complex structured and semi-structured data.

{
    "field1": "value1",
    "field2": ["list", "of", "values"],
    "myfield3": {"is_recursive": true, "a null value": null}
}

A few key points:

  • JSON is a recursive format in that JSON fields can also contain JSON objects
  • JSON closely matches Python Dictionaries:
    d = {
      "field1": "value1",
      "field2": ["list", "of", "values"],
      "myfield3": {"is_recursive": True, "a null value": None}
    }
    print(d['myfield3'])
    
  • Very common in web technologies (... JavaScript)
  • Many languages have tools for loading and saving JSON objects

Loading the Data

We will now attempt to load the data into python. We will be using the Pandas dataframe library for basic tabular data analysis. Fortunately, the Pandas library has some relatively sophisticated functions for loading data.

Loading the Calls Data

Because the file appears to be a relatively well formatted CSV we will attempt to load it directly and allow the Pandas Library to deduce column headers. (Always check that first row and column look correct after loading.)

In [23]:
calls = pd.read_csv(calls_file)
calls.head()
Out[23]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA
3 17091517 THEFT MISD. (UNDER $950) 08/01/2017 12:00:00 AM 00:30 LARCENY 2 01/24/2018 03:30:11 AM 1600 CALIFORNIA ST\nBerkeley, CA\n(37.876791, ... 1600 CALIFORNIA ST Berkeley CA
4 17048102 THEFT FROM AUTO 08/13/2017 12:00:00 AM 00:40 LARCENY - FROM VEHICLE 0 01/24/2018 03:30:11 AM 2400 HILLSIDE AVE\nBerkeley, CA\n(37.867373, -... 2400 HILLSIDE AVE Berkeley CA

How many records did we get?

In [24]:
calls.size
Out[24]:
58916

Preliminary observations on the data?

  1. EVENTDT -- Contain the incorrect time stamp
  2. EVENTTM -- Contains the time in 24 hour format (What timezone?)
  3. CVDOW -- Appears to be some encoding of the day of the week (see data documentation).
  4. InDbDate -- Appears to be correctly formatted and appears pretty consistent in time.
  5. Block_Location -- Errr, what a mess! newline characters, and Geocoordinates all merged!! Fortunately, this field was "quoted" otherwise we would have had trouble parsing the file. (why?)
  6. BLKADDR -- This appears to be the address in Block Location.
  7. City and State seem redundant given this is supposed to be the city of Berkeley dataset.

Checking that the City and State fields are redundant

We notice that there are city and state columns. Since this is supposed to be data for the city of Berkeley these columns appear to be redundant. Let's quickly compute the number of occurences of unique values for these two columns.

In [25]:
calls.groupby(["City", "State"]).count()
Out[25]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR
City State
Berkeley CA 5356 5356 5356 5356 5356 5356 5356 5356 5340

Decoding day of the week

According to the documentation CVDOW=0 is Sunday, CVDOW=1 is Monday, ..., Therefore we can make a series to decode the day of the week for each record and join that series with the calls data.

In [26]:
dow = pd.Series(["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"], name="DoW")
dow
Out[26]:
0       Sunday
1       Monday
2      Tuesday
3    Wednesday
4     Thursday
5       Friday
6     Saturday
Name: DoW, dtype: object
In [27]:
calls['Day'] = dow[calls['CVDOW']].reset_index(drop=True)
calls.head()
Out[27]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA Tuesday
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA Wednesday
3 17091517 THEFT MISD. (UNDER $950) 08/01/2017 12:00:00 AM 00:30 LARCENY 2 01/24/2018 03:30:11 AM 1600 CALIFORNIA ST\nBerkeley, CA\n(37.876791, ... 1600 CALIFORNIA ST Berkeley CA Tuesday
4 17048102 THEFT FROM AUTO 08/13/2017 12:00:00 AM 00:40 LARCENY - FROM VEHICLE 0 01/24/2018 03:30:11 AM 2400 HILLSIDE AVE\nBerkeley, CA\n(37.867373, -... 2400 HILLSIDE AVE Berkeley CA Sunday

We can also accomplish the same task using a join.

In [28]:
df_dow = pd.DataFrame(dow)
pd.merge(calls, df_dow, left_on='CVDOW', right_index=True).head()
Out[28]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day DoW
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday Monday
16 17091876 BURGLARY AUTO 09/18/2017 12:00:00 AM 20:00 BURGLARY - VEHICLE 1 01/24/2018 03:30:13 AM 2000 BERKELEY WAY\nBerkeley, CA\n(37.872691, -... 2000 BERKELEY WAY Berkeley CA Monday Monday
26 17047728 VEHICLE STOLEN 08/14/2017 12:00:00 AM 07:50 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:11 AM 1700 FRANKLIN ST\nBerkeley, CA\n(37.874427, -1... 1700 FRANKLIN ST Berkeley CA Monday Monday
35 17091546 VANDALISM 07/31/2017 12:00:00 AM 21:00 VANDALISM 1 01/24/2018 03:30:11 AM 800 DELAWARE ST\nBerkeley, CA\n(37.870393, -12... 800 DELAWARE ST Berkeley CA Monday Monday
39 17073608 ASSAULT/BATTERY FEL. 12/04/2017 12:00:00 AM 09:10 ASSAULT 1 01/24/2018 03:30:17 AM 2400 SAN PABLO AVE\nBerkeley, CA\n(37.862542, ... 2400 SAN PABLO AVE Berkeley CA Monday Monday

Cleaning Block Location

The block location contains the GPS coordinates and I would like to use these to analyze the location of each request. Let's try to extract the GPS coordinates using regular expressions (we will cover regular expressions in future lectures):

In [29]:
calls['Block_Location'][0]
Out[29]:
'1100 PARKER ST\nBerkeley, CA\n(37.859364, -122.288914)'
In [30]:
calls_lat_lon = (
    # Remove newlines
    calls['Block_Location'].str.replace("\n", "\t") 
    # Extract Lat and Lon using regular expression
    .str.extract(".*\((?P<Lat>\d*\.\d*)\, (?P<Lon>-?\d*.\d*)\)",expand=True)
)
calls_lat_lon.head(10)
Out[30]:
Lat Lon
0 37.859364 -122.288914
1 37.874867 -122.263689
2 37.857495 -122.275256
3 37.876791 -122.280472
4 37.867373 -122.249515
5 37.882217 -122.302226
6 37.858214 -122.269252
7 37.871828 -122.270516
8 37.88427 -122.276737
9 37.850424 -122.296031

Double check that all of the records have a latitude and longitude:

In [31]:
calls_lat_lon.isnull().sum()
Out[31]:
Lat    250
Lon    250
dtype: int64

The following block of code joins the extracted Latitude and Longitude fields with the calls data. Notice that we actually drop these fields before joining. This is to enable repeated invocation of this cell even after the join has been completed.

In [32]:
# Remove Lat and Lon if they already existed before (reproducible)
calls.drop(["Lat", "Lon"], axis=1, inplace=True, errors="ignore")
# Join in the the latitude and longitude data
calls = calls.merge(calls_lat_lon, left_index=True, right_index=True)
# calls[["Lat", "Lon"]] = calls_lat_lon
# calls.join(calls_lat_lon)
calls.head()
Out[32]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day Lat Lon
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday 37.859364 -122.288914
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA Tuesday 37.874867 -122.263689
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA Wednesday 37.857495 -122.275256
3 17091517 THEFT MISD. (UNDER $950) 08/01/2017 12:00:00 AM 00:30 LARCENY 2 01/24/2018 03:30:11 AM 1600 CALIFORNIA ST\nBerkeley, CA\n(37.876791, ... 1600 CALIFORNIA ST Berkeley CA Tuesday 37.876791 -122.280472
4 17048102 THEFT FROM AUTO 08/13/2017 12:00:00 AM 00:40 LARCENY - FROM VEHICLE 0 01/24/2018 03:30:11 AM 2400 HILLSIDE AVE\nBerkeley, CA\n(37.867373, -... 2400 HILLSIDE AVE Berkeley CA Sunday 37.867373 -122.249515

We can now look at a few of the records that were missing latitude and longitude entries:

In [33]:
calls[calls['Lat'].isnull()].head()
Out[33]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day Lat Lon
65 17067615 THEFT MISD. (UNDER $950) 11/06/2017 12:00:00 AM 11:25 LARCENY 1 01/24/2018 03:30:16 AM 200 SHATTUCK SQ\nBerkeley, CA\n 200 SHATTUCK SQ Berkeley CA Monday NaN NaN
102 17092153 THEFT MISD. (UNDER $950) 10/27/2017 12:00:00 AM 22:11 LARCENY 5 01/24/2018 03:30:16 AM 2121 DWIGHT 417\nBerkeley, CA\n 2121 DWIGHT 417 Berkeley CA Friday NaN NaN
119 17092279 THEFT MISD. (UNDER $950) 11/13/2017 12:00:00 AM 02:00 LARCENY 1 01/24/2018 03:30:17 AM 300 DOMINGO AVE\nBerkeley, CA\n 300 DOMINGO AVE Berkeley CA Monday NaN NaN
133 17047850 BURGLARY AUTO 08/14/2017 12:00:00 AM 18:30 BURGLARY - VEHICLE 1 01/24/2018 03:30:11 AM 500 WHITAKER AVE\nBerkeley, CA\n 500 WHITAKER AVE Berkeley CA Monday NaN NaN
197 18001775 NARCOTICS 01/10/2018 12:00:00 AM 02:42 DRUG VIOLATION 3 01/24/2018 03:30:18 AM FRANCISCO STREET &amp; M L KING JR WAY\nBerkel... FRANCISCO STREET & M L KING JR WAY Berkeley CA Wednesday NaN NaN




Loading the stops.json Data

Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary.

In [34]:
import json

with open("data/stops.json", "rb") as f:
    stops_json = json.load(f)

The stops_json variable is now a dictionary encoding the data in the file:

In [35]:
type(stops_json)
Out[35]:
dict




We can now examine what keys are in the top level json object

We can list the keys to determine what data is stored in the object.

In [36]:
stops_json.keys()
Out[36]:
dict_keys(['meta', 'data'])

Observation

The JSON dictionary contains a meta key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.





Digging into Meta Data

We can investigate the meta data further by examining the keys associated with the metadata.

In [37]:
stops_json['meta'].keys()
Out[37]:
dict_keys(['view'])

The meta key contains another dictionary called view. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views as we study SQL later in the class.

In [38]:
stops_json['meta']['view'].keys()
Out[38]:
dict_keys(['id', 'name', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:

meta
|-> data
    | ... (haven't explored yet)
|-> view
    | -> id
    | -> name
    | -> attribution 
    ...

There is a key called description in the view sub dictionary. This likely contains a description of the data:

In [39]:
print(stops_json['meta']['view']['description'])
This data was extracted from the Department’s Public Safety Server and covers the data beginning January 26, 2015.  On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014).  Under that order, officers were required to provide certain data after making all vehicle detentions (including bicycles) and pedestrian detentions (up to five persons).  This data set lists stops by police in the categories of traffic, suspicious vehicle, pedestrian and bicycle stops.  Incident number, date and time, location and disposition codes are also listed in this data.

Address data has been changed from a specific address, where applicable, and listed as the block where the incident occurred.  Disposition codes were entered by officers who made the stop.  These codes included the person(s) race, gender, age (range), reason for the stop, enforcement action taken, and whether or not a search was conducted.

The officers of the Berkeley Police Department are prohibited from biased based policing, which is defined as any police-initiated action that relies on the race, ethnicity, or national origin rather than the behavior of an individual or information that leads the police to a particular individual who has been identified as being engaged in criminal activity.




Columns Meta data

Another potentially useful key in the meta data dictionary is the columns. This returns a list:

In [40]:
type(stops_json['meta']['view']['columns'])
Out[40]:
list

We can brows the list using python:

In [41]:
for c in stops_json['meta']['view']['columns']:
    print(c["name"], "----------->")
    if "description" in c:
        print(c["description"])
    print("======================================\n")
sid ----------->
======================================

id ----------->
======================================

position ----------->
======================================

created_at ----------->
======================================

created_meta ----------->
======================================

updated_at ----------->
======================================

updated_meta ----------->
======================================

meta ----------->
======================================

Incident Number ----------->
Number of incident created by Computer Aided Dispatch (CAD) program
======================================

Call Date/Time ----------->
Date and time of the incident/stop
======================================

Location ----------->
General location of the incident/stop
======================================

Incident Type ----------->
This is the occurred incident type created in the CAD program.  A code signifies a traffic stop (T), suspicious vehicle stop (1196), pedestrian stop (1194) and bicycle stop (1194B).
======================================

Dispositions ----------->
Ordered in the following sequence:
1st Character = Race, as follows:
A (Asian) B (Black) H (Hispanic) O (Other) W (White)
2nd Character = Gender, as follows:
F (Female) M (Male)
3rd Character = Age Range, as follows:
1 (Less than 18) 2 (18-29) 3 (30-39), 4 (Greater than 40)
4th Character = Reason, as follows:
I (Investigation) T (Traffic) R (Reasonable Suspicion)
K (Probation/Parole) W (Wanted)
5th Character = Enforcement, as follows:
A (Arrest) C (Citation) O (Other) W (Warning)
6th Character = Car Search, as follows:
S (Search) N (No Search)

Additional dispositions may also appear.  They are:
P - Primary case report
M - MDT narrative only
AR - Arrest report only (no case report submitted)
IN - Incident report
FC - Field Card
CO - Collision investigation report
MH - Emergency Psychiatric Evaluation
TOW - Impounded vehicle
0 or 00000 – Officer made a stop of more than five persons
======================================

Location - Latitude ----------->
General latitude of the call.  This data is only uploaded after January 2017
======================================

Location - Longitude ----------->
General longitude of the call.  This data is only uploaded after January 2017.
======================================

Observations?

  1. The above meta data tells us a lot about the columns in the data including both column names and even descriptions. This information will be useful in loading and working with the data.
  2. JSON makes it easier (than CSV) to create "self-documented data".
  3. Self documenting data can be helpful since it maintains it's own description and these descriptions are more likely to be updated as data changes.

Examining the Data Field

We can look at a few entires in the data field

In [42]:
stops_json['data'][0:2]
Out[42]:
[[1,
  '29A1B912-A0A9-4431-ADC9-FB375809C32E',
  1,
  1444146408,
  '932858',
  1444146408,
  '932858',
  None,
  '2015-00004825',
  '2015-01-26T00:10:00',
  'SAN PABLO AVE / MARIN AVE',
  'T',
  'M',
  None,
  None],
 [2,
  '1644D161-1113-4C4F-BB2E-BF780E7AE73E',
  2,
  1444146408,
  '932858',
  1444146408,
  '932858',
  None,
  '2015-00004829',
  '2015-01-26T00:50:00',
  'SAN PABLO AVE / CHANNING WAY',
  'T',
  'M',
  None,
  None]]

Building a Dataframe from JSON

In the following block of code we:

  1. Translate the JSON records into a dataframe
  2. Remove columns that have no metadata description. This would be a bad idea in general but here we remove these columns since the above analysis suggests that they are unlikely to contain useful information.
  3. Examine the top of the table
In [43]:
# Load the data from JSON and assign column titles
stops = pd.DataFrame(
    stops_json['data'],
    columns=[c['name'] for c in stops_json['meta']['view']['columns']])

stops.head()
Out[43]:
sid id position created_at created_meta updated_at updated_meta meta Incident Number Call Date/Time Location Incident Type Dispositions Location - Latitude Location - Longitude
0 1 29A1B912-A0A9-4431-ADC9-FB375809C32E 1 1444146408 932858 1444146408 932858 None 2015-00004825 2015-01-26T00:10:00 SAN PABLO AVE / MARIN AVE T M None None
1 2 1644D161-1113-4C4F-BB2E-BF780E7AE73E 2 1444146408 932858 1444146408 932858 None 2015-00004829 2015-01-26T00:50:00 SAN PABLO AVE / CHANNING WAY T M None None
2 3 5338ABAB-1C96-488D-B55F-6A47AC505872 3 1444146408 932858 1444146408 932858 None 2015-00004831 2015-01-26T01:03:00 UNIVERSITY AVE / NINTH ST T M None None
3 4 21B6CBE4-9865-460F-97BC-6B26C6EF2FDB 4 1444146408 932858 1444146408 932858 None 2015-00004848 2015-01-26T07:16:00 2000 BLOCK BERKELEY WAY 1194 BM4ICN None None
4 5 0D85FA92-80E9-48C2-B409-C3270251CD12 5 1444146408 932858 1444146408 932858 None 2015-00004849 2015-01-26T07:43:00 1700 BLOCK SAN PABLO AVE 1194 BM4ICN None None

Preliminary Observations

What do we observe so far?

We observe:

  1. The Incident Number appears to have the year encoded in it.
  2. The Call Date/Time Field looks to be formatted in YYYY-MM-DDTHH:MM:SS. I am guessing T means "Time".
  3. Incident Type has some weird coding. This is define in the column meta data: This is the occurred incident type created in the CAD program. A code signifies a traffic stop (T), suspicious vehicle stop (1196), pedestrian stop (1194) and bicycle stop (1194B).

Recall the description:

Stop Data





EDA

Now that we have loaded our various data files. Let's try to understand a bit more about the data by examining properties of individual fields.

EDA on the Calls Data

In [44]:
calls.head()
Out[44]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day Lat Lon
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday 37.859364 -122.288914
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA Tuesday 37.874867 -122.263689
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA Wednesday 37.857495 -122.275256
3 17091517 THEFT MISD. (UNDER $950) 08/01/2017 12:00:00 AM 00:30 LARCENY 2 01/24/2018 03:30:11 AM 1600 CALIFORNIA ST\nBerkeley, CA\n(37.876791, ... 1600 CALIFORNIA ST Berkeley CA Tuesday 37.876791 -122.280472
4 17048102 THEFT FROM AUTO 08/13/2017 12:00:00 AM 00:40 LARCENY - FROM VEHICLE 0 01/24/2018 03:30:11 AM 2400 HILLSIDE AVE\nBerkeley, CA\n(37.867373, -... 2400 HILLSIDE AVE Berkeley CA Sunday 37.867373 -122.249515




Checking that City and State are Redundant

We notice that there are city and state columns. Since this is supposed to be data for the city of Berkeley these columns appear to be redundant. Let's quickly compute the number of occurences of unique values for these two columns.

In [45]:
calls.groupby(["City", "State"]).count()
Out[45]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR Day Lat Lon
City State
Berkeley CA 5356 5356 5356 5356 5356 5356 5356 5356 5340 5356 5106 5106




Are Case Numbers unique?

Case numbers are probably used internally to track individual cases and my reference other data we don't have access to. However, it is possible that multiple calls could be associated with the same case. Let's see if the case numbers are all unique.

In [46]:
print("There are", len(calls['CASENO'].unique()), "unique case numbers.")
print("There are", len(calls), "calls in the table.")
There are 5356 unique case numbers.
There are 5356 calls in the table.

Are case numbers assigned consecutively.

In [47]:
calls['CASENO'].sort_values().reset_index(drop=True).plot()
plt.xlabel("Ordered Location in Data")
plt.ylabel("Case Number")
Out[47]:
Text(0,0.5,'Case Number')

I like to use interactive plotting tools so I can hover the mouse over the plot and read the values. The cufflinks library adds plotly support to Pandas.

In [48]:
calls['CASENO'].sort_values().reset_index(drop=True).iplot(
    layout = dict(yaxis=dict(title="Case Number", hoverformat="d"), 
                  xaxis=dict(title="Ordered Location in Data")))
Out[48]:

Examining the distribution of case numbers shows a similar pattern

In [49]:
calls['CASENO'].iplot(kind="hist",bins=100)
Out[49]:

What might we be observing?

One possible explanation is that case numbers were assigned consecutively and then sampled uniformly at different rates for two different periods. We will be able to understand this better by looking at the dates on the cases.





Examining the Date

Given the weird behavior with the case numbers let's dig into the date in which events were recorded. Notice in this data we have several pieces of date/time information (this is not uncommon):

  1. EVENTDT: This contains the date the event took place. While it has time information the time appears to be 00:00:00.
  2. EVENTTM: This contains the time at which the event took place.
  3. InDbDate: This appears to be the date at which the data was entered in the database.
In [50]:
calls.head(3)
Out[50]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day Lat Lon
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday 37.859364 -122.288914
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA Tuesday 37.874867 -122.263689
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA Wednesday 37.857495 -122.275256

When Pandas loads more complex fields like dates it will often load them as strings:

In [51]:
calls["EVENTDT"][0]
Out[51]:
'01/01/2018 12:00:00 AM'

We will want to convert these to dates. Pandas has a fairly sophisticated function pd.to_datetime which is capable of guessing reasonable conversions of dates to date objects.

In [52]:
dates = pd.to_datetime(calls["EVENTDT"])
dates[0]
Out[52]:
Timestamp('2018-01-01 00:00:00')

We can verify that the translations worked by looking at a few dates:

In [53]:
pd.DataFrame(dict(transformed=dates, original=calls["EVENTDT"])).head()
Out[53]:
original transformed
0 01/01/2018 12:00:00 AM 2018-01-01
1 12/12/2017 12:00:00 AM 2017-12-12
2 12/20/2017 12:00:00 AM 2017-12-20
3 08/01/2017 12:00:00 AM 2017-08-01
4 08/13/2017 12:00:00 AM 2017-08-13

We can also extract the time field:

In [54]:
times = pd.to_datetime(calls["EVENTTM"]).dt.time
times.head()
Out[54]:
0    20:30:00
1    13:30:00
2    05:00:00
3    00:30:00
4    00:40:00
Name: EVENTTM, dtype: object

To combine the correct date and correct time field we use the built-in python datetime combine function.

In [55]:
from datetime import datetime
timestamps = pd.to_datetime(pd.concat([dates, times], axis=1).apply(
    lambda r: datetime.combine(r['EVENTDT'], r['EVENTTM']), axis=1))
timestamps.head()
Out[55]:
0   2018-01-01 20:30:00
1   2017-12-12 13:30:00
2   2017-12-20 05:00:00
3   2017-08-01 00:30:00
4   2017-08-13 00:40:00
dtype: datetime64[ns]

We now updated calls to contain this additional informations:

In [56]:
calls['timestamp'] = timestamps
calls.head()
Out[56]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day Lat Lon timestamp
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday 37.859364 -122.288914 2018-01-01 20:30:00
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA Tuesday 37.874867 -122.263689 2017-12-12 13:30:00
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA Wednesday 37.857495 -122.275256 2017-12-20 05:00:00
3 17091517 THEFT MISD. (UNDER $950) 08/01/2017 12:00:00 AM 00:30 LARCENY 2 01/24/2018 03:30:11 AM 1600 CALIFORNIA ST\nBerkeley, CA\n(37.876791, ... 1600 CALIFORNIA ST Berkeley CA Tuesday 37.876791 -122.280472 2017-08-01 00:30:00
4 17048102 THEFT FROM AUTO 08/13/2017 12:00:00 AM 00:40 LARCENY - FROM VEHICLE 0 01/24/2018 03:30:11 AM 2400 HILLSIDE AVE\nBerkeley, CA\n(37.867373, -... 2400 HILLSIDE AVE Berkeley CA Sunday 37.867373 -122.249515 2017-08-13 00:40:00

What time range does the data represent

In [57]:
calls['timestamp'].min()
Out[57]:
Timestamp('2017-07-18 02:49:00')
In [58]:
calls['timestamp'].max()
Out[58]:
Timestamp('2018-01-12 08:13:00')




Back to the Case Numbers

In [59]:
(
    calls
        .sort_values('timestamp')
        .iplot(
            mode='markers', size=5,
            x='timestamp', y='CASENO',
            layout = dict(yaxis=dict(title="Case Number", hoverformat="d"), 
                  xaxis=dict(title="Date of Call")))
)
Out[59]:

Explanation?

Perhaps there are multiple different record books with different numbering schemes? This might be something worth investigating further.

A Negative Result...

In the above analysis of case numbers all we can really conclude is that we don't understand how this variable was created. The patterns in the values suggest that there may be some sampling and it is possibly worth contacting the data provider to learn more.





Are there any other interesting temporal patterns

Do more calls occur on a particular day of the week?

In [60]:
dow = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
calls.groupby('Day')['CASENO'].count()[dow].iplot(kind='bar', yTitle="Count")
Out[60]:

How about temporal patterns within a day?

In [61]:
ts = calls['timestamp']
minute_of_day = ts.dt.hour * 60 + ts.dt.minute
hour_of_day = minute_of_day / 60.

calls['minute_of_day'] = minute_of_day
calls['hour_of_day'] = hour_of_day
In [62]:
py.iplot(ff.create_distplot([hour_of_day],group_labels=["Hour"],bin_size=1))

Observations?

In the above plot we see the standard pattern of limited activity early in the morning around here 6:00AM.





Smoothing Parameters

In the above plot we see a smoothed curve approximating the histogram. This is an example of a kernel density estimator (KDE). The KDE, like the histogram, has a parameter that determines it's smoothness. Many packages (Plotly and Seaborn) use a boostrap like procedure to choose the best value.

To understand how this parameter works we will use seaborn which let's you also set the parameter manually.

In [63]:
sns.distplot(hour_of_day, rug=False)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x11101d908>

Notice in the above plot that the interpolation tries to follow the histogram to closely.

Stratified Analysis

To better understand the time of day a report occurs we could stratify the analysis by the day of the week. To do this we will use box plots.

In [64]:
## Using Pandas built in box plot
# calls['hour_of_day'] = minute_in_day
# calls.boxplot('hour_of_day', by='Day')
In [65]:
groups = calls.groupby("Day")
boxes = {i: go.Box(y=df["hour_of_day"], name=i) for (i, df) in groups}
py.iplot([boxes[d] for d in dow])
In [66]:
py.iplot(ff.create_violin(calls, data_header='hour_of_day', group_header="Day"))

Observations?

There are no very clear patterns here. It is possible Fridays might have crimes shifted slightly more to the evening. In general the time a crime is reported seems fairly consistent throughout the week.





Examining the Event

We also have data about the different kinds of crimes being reported

In [67]:
calls.head()
Out[67]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Day Lat Lon timestamp minute_of_day hour_of_day
0 18000273 VEHICLE STOLEN 01/01/2018 12:00:00 AM 20:30 MOTOR VEHICLE THEFT 1 01/24/2018 03:30:18 AM 1100 PARKER ST\nBerkeley, CA\n(37.859364, -122... 1100 PARKER ST Berkeley CA Monday 37.859364 -122.288914 2018-01-01 20:30:00 1230 20.500000
1 17092476 BURGLARY AUTO 12/12/2017 12:00:00 AM 13:30 BURGLARY - VEHICLE 2 01/24/2018 03:30:17 AM 2300 LE CONTE AVE\nBerkeley, CA\n(37.874867, -... 2300 LE CONTE AVE Berkeley CA Tuesday 37.874867 -122.263689 2017-12-12 13:30:00 810 13.500000
2 17092534 BURGLARY AUTO 12/20/2017 12:00:00 AM 05:00 BURGLARY - VEHICLE 3 01/24/2018 03:30:17 AM 1700 STUART ST\nBerkeley, CA\n(37.857495, -122... 1700 STUART ST Berkeley CA Wednesday 37.857495 -122.275256 2017-12-20 05:00:00 300 5.000000
3 17091517 THEFT MISD. (UNDER $950) 08/01/2017 12:00:00 AM 00:30 LARCENY 2 01/24/2018 03:30:11 AM 1600 CALIFORNIA ST\nBerkeley, CA\n(37.876791, ... 1600 CALIFORNIA ST Berkeley CA Tuesday 37.876791 -122.280472 2017-08-01 00:30:00 30 0.500000
4 17048102 THEFT FROM AUTO 08/13/2017 12:00:00 AM 00:40 LARCENY - FROM VEHICLE 0 01/24/2018 03:30:11 AM 2400 HILLSIDE AVE\nBerkeley, CA\n(37.867373, -... 2400 HILLSIDE AVE Berkeley CA Sunday 37.867373 -122.249515 2017-08-13 00:40:00 40 0.666667

The Offense Field

The Offense field appears to contain the specific crime being reported. As nominal data we might want to see a summary constructed by computing counts of each offense type:

In [68]:
calls['OFFENSE'].value_counts(dropna=False).iplot(kind="bar")
Out[68]:

Observations?

Car burglary and misdemeanor theft seem to be the most common crimes with many other types of crimes occurring rarely.





CVLEGEND

The CVLEGEND filed provides the broad category of crime and is a good mechanism to group potentially similar crimes.

In [69]:
calls['CVLEGEND'].value_counts(dropna=False).iplot(kind="bar")
Out[69]:

Notice that when we group by the crime time we see that larceny becomes the dominant category. Larceny is essentially stealing -- taking someone else stuff without force.

Stratified Analysis of Time of Day by CVLEGEND

View the crime time periods broken down by crime type:

In [70]:
boxes = [(len(df), go.Box(y=df["hour_of_day"], name=i)) for (i, df) in calls.groupby("CVLEGEND")]
py.iplot([r[1] for r in sorted(boxes, key=lambda x:x[0], reverse=True)])
In [71]:
py.iplot(ff.create_distplot([
    calls[calls['CVLEGEND'] == "NOISE VIOLATION"]['hour_of_day'],
    calls[calls['CVLEGEND'] == "DRUG VIOLATION"]['hour_of_day'],
    calls[calls['CVLEGEND'] == "LIQUOR LAW VIOLATION"]['hour_of_day'],
    calls[calls['CVLEGEND'] == "FRAUD"]['hour_of_day']
],
    group_labels=["Noise Violation", "Drug Violation", "Liquor Violation", "Fraud"], 
    ))

Examining Location information

Let's examine the geographic data (latitude and longitude). Recall that we had some missing values. Let's look at the behavior of these missing values according to crime type.

In [72]:
missing_lat_lon = calls[calls[['Lat', 'Lon']].isnull().any(axis=1)]
missing_lat_lon['CVLEGEND'].value_counts().iplot(kind="bar")
Out[72]:

Observations?

There is a clear bias towards drug violations that is not present in the original data. Therefore we should be careful when dropping missing values!

We might further normalize the analysis by the frequency to find which type of crime has the highest proportion of missing values.

In [73]:
(
    missing_lat_lon['CVLEGEND'].value_counts() / calls['CVLEGEND'].value_counts()
).sort_values(ascending=False).iplot(kind="bar")
Out[73]:




Day of Year and Missing Values:

We might also want to examine the time of day in which latitude and longitude are missing:

In [74]:
ts = calls['timestamp']

py.iplot(ff.create_distplot([calls['timestamp'].dt.dayofyear, 
                             missing_lat_lon['timestamp'].dt.dayofyear],
                            group_labels=["all data", "missing lat-lon"],bin_size=2))

Observations?

Pretty similar

In [75]:
!pip install --upgrade folium
import folium
print(folium.__version__, "should be at least 0.3.0")
Requirement already up-to-date: folium in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages
Requirement already up-to-date: jinja2 in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from folium)
Requirement already up-to-date: branca in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from folium)
Requirement already up-to-date: six in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from folium)
Requirement already up-to-date: requests in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from folium)
Requirement already up-to-date: MarkupSafe>=0.23 in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from jinja2->folium)
Requirement already up-to-date: chardet<3.1.0,>=3.0.2 in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from requests->folium)
Requirement already up-to-date: idna<2.7,>=2.5 in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from requests->folium)
Requirement already up-to-date: urllib3<1.23,>=1.21.1 in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from requests->folium)
Requirement already up-to-date: certifi>=2017.4.17 in /Users/jegonzal/anaconda3/envs/data100/lib/python3.6/site-packages (from requests->folium)
0.5.0 should be at least 0.3.0
In [76]:
import folium
import folium.plugins # The Folium Javascript Map Library

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Lat', 'Lon']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)
Out[76]:

Questions

  1. Is campus really the safest place to be?
  2. Why are all the calls located on the street and at often at intersections?
In [77]:
cluster = folium.plugins.MarkerCluster()
for _, r in calls[['Lat', 'Lon', 'CVLEGEND']].tail(1000).dropna().iterrows():
    cluster.add_child(
        folium.Marker([float(r["Lat"]), float(r["Lon"])], popup=r['CVLEGEND']))
    
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
sf_map.add_child(cluster)
sf_map
Out[77]: